library(e1071) # to understand skewness
library(dplyr)
library(stringr) # Used to rename the columns by removing the word team from the column header
library(VIM) # To understand NAs
library(caret)
library(mice)
library(MASS) # to use for robust Linear Regression.
# browse to the data
moneyball = read.csv('/Users/legs_jorge/Documents/Data Science Projects/MSDS_Northwestern/MSDS 411/Unit 01 Moneyball Baseball Problem/Data/moneyball.csv', header = T)
colnames(moneyball) <- str_replace_all(colnames(moneyball),"TEAM_","") %>%
tolower() # Fixing column names
The moneyball dataset has sparked many companies, teams, and organizations to understand and utilize the data they generate/gather. This project highlights many pitfalls that those same individuals fall into simply because they forgot to do the due diligence and prepare the data before modeling.
This paper will focus on;
1. Data Exploration
2. Data Transformation
3. Model Building
4. How to select the best model
Outliers can cause our model to produce the wrong output by influencing its fit. Creating boxplots will aid in identifying those outliers. We can also use the cleveland dotplot to understand the outliers better. This technique uses the row number against actual value to quickly point out any patterns of outliers. This plot will easilly allow us to check the raw data for errors such as typos during the data collection phase. Points on the far right side, or on the far left side, are observed values that are considerably larger, or smaller, than the majority of the observations, and require further investigation. When we use this chart, together with the box plot and histogram, we can easily identify patterns at to where in the data we’re seeing outliers.
par(mfrow = c(1, 3))
i = 2
while (i %in% c(2:17)) {
plot(moneyball[,i], moneyball$index, xlab = colnames(moneyball)[i] , ylab = "Index", main = paste("cleveland dotplot of ",colnames(moneyball)[i]))
boxplot(moneyball[,i], col = "#A71930", main = paste("Boxplot of ",colnames(moneyball)[i]))
hist(
moneyball[,i],
col = "#A71930",
xlab = colnames(moneyball)[i],
main = paste("Histogram of ",colnames(moneyball)[i])
)
i = i + 1
}
It looks like the outliers are legitmate and we will try Spatial Sign transformation to deal with them.
Now that step one is done, let’s look at step 2.
From the historgram above we can clearly see that the data is not normal, with the exception of some that seems to sort of follow a normal distribution. Let’s use QQ-plot to test each column for normality, while adding a histogram and a Skewness number.
- If skewness is less than −1 or greater than +1, the distribution is highly skewed.
- If skewness is between −1 and −½ or between +½ and +1, the distribution is moderately skewed.
- If skewness is between −½ and +½, the distribution is approximately symmetric.
par(mfrow = c(2, 2))
i = 2
while (i %in% c(2:17)) {
qqnorm(moneyball[,i], main = paste("QQ-Plot of ",colnames(moneyball)[i]));qqline(moneyball[,i], col = 2)
hist(
moneyball[,i],
col = "#A71930",
xlab = colnames(moneyball)[i],
main = paste0("Skewness = ",skewness(moneyball[,i]))
)
i = i + 1
}
We would need to try certain transformation to correct for Skewness, with Box-Cox being the number one choice.
R gives us a lot of ways to understand the distribution of Nulls within the data. Let’s first try to calculate the percentage of Null values to the total number of observation.
NAPerc <-
sapply(moneyball, function(x)
(sum(is.na(x)) / length(x)) * 100) %>%
data.frame()
NAPerc$Column <- rownames(NAPerc)
colnames(NAPerc) <- c("NA_Perc", "Col_Name")
# Trying to understand the percentage of NAs per Column
NA_col <- subset(NAPerc, NA_Perc > 0) %>% arrange(desc(NA_Perc))
NA_col
Let’s look at the pattern of missing data to try to get more insights. It’s clear that batting_hbp is going to be a problematic column with 92% of the data missing. Before we start the imputation or deleting variables, let’s try to understand why we have missing data.
Let’s use the mice package to help us understant how all the NAs behave in the data. mice provides a handy function called md.pattern that allows one to understand the pattern of missing data. Hopefully by looking at the pattern, we can have an idea on why the data could be missing.
md.pattern(moneyball) %>% data.frame()
The first column of the output shows the number of unique missing data patterns. There are 191 observations with nonmissing values, and there are 1295 observations with nonmissing values except for the variable batting_hbp. The rightmost column shows the number of missing variables in a particular missing pattern. For example, the first row has no missing value and it is “0” in the row. The last row counts the number of missing values for each variable. For example, the variable pitching_bb contains no missing values and the variable batting_so contains 102 missing values. This table can be helpful when you decide to drop some observations with missing variables exceeding a preset threshold.
After careful analysis, the decision is to keep batting_hbp. Because I want to transform it into a binary variable, I will keep it out until all th eimputation is done.
batting_hbp_bi <- if_else(is.na(moneyball$batting_hbp),0,1)
batting_hbp <- moneyball$batting_hbp
moneyball_trans <- subset(moneyball, select = -c(batting_hbp))
Let’s impute and treat the data for missing values before testing it for multicollinearity.
The mice package will be the package used to help us with this task. Since we only have numeric values, mice will automatically chose PMM (Predictive Mean Matching) as the method. A great resource to understand this techinique is found here.
Let’s add batting_hbp back into the data.
moneyball_imp$batting_hbp <- batting_hbp
moneyball_imp$batting_hbp_bi <- batting_hbp_bi
Let’s create a series of correlation matix to understand how each independent variable interacts with the dependent variable. This correlation matix will help us spot any infrigement of the assupmtions needed to develop a robust OLS model, namely multicollinearity. The caret package can help the user find those pairs and even suggest which one to remove.
The Caret package offers the findcorrelation(), which takes the correlation matrix as an input and finds the fields causing multicollinearity based on a threshold, the cutoff parameter. It in turns returns a vector with values that would need to be removed from our dataset due to correlation.
colnames(moneyball_imp)[findCorrelation(cor(moneyball_imp))]
[1] "batting_hr"
Let’s introduce new variables through transformation:
batting_1B = batting_h-(batting_2b + batting_3b + batting_hr)free_bases_num = batting_hbp + batting_bbtotal_bases = batting_1B + 2 * batting_2b + 3 * batting_3b + 4 * batting_hr + batting_bb + batting_hbp + baserun_sbtotal_bases_allowed = pitching_bb + 4 * pitching_hr + pitching_hHR_over_OP = batting_hr - pitching_hrwalks_over_OP = batting_bb - pitching_bbSO_over_OP = pitching_so - batting_somoneyball_imp$batting_1B <- moneyball_imp$batting_h-(moneyball_imp$batting_2b + moneyball_imp$batting_3b + moneyball_imp$batting_hr)
moneyball_imp$free_bases_num <- if_else(is.na(moneyball_imp$batting_hbp),0,as.numeric(moneyball_imp$batting_hbp)) + moneyball_imp$batting_bb
moneyball_imp$total_bases <- moneyball_imp$batting_1B + 2 * moneyball_imp$batting_2b + 3 * moneyball_imp$batting_3b + 4 * moneyball_imp$batting_hr + moneyball_imp$batting_bb + if_else(is.na(moneyball_imp$batting_hbp),0,as.numeric(moneyball_imp$batting_hbp)) + moneyball_imp$baserun_sb
moneyball_imp$total_bases_allowed = moneyball_imp$pitching_bb + 4 * moneyball_imp$pitching_hr + moneyball_imp$pitching_h
moneyball_imp$HR_over_OP = moneyball_imp$batting_hr - moneyball_imp$pitching_hr
moneyball_imp$walks_over_OP = moneyball_imp$batting_bb - moneyball_imp$pitching_bb
moneyball_imp$SO_over_OP = moneyball_imp$pitching_so - moneyball_imp$batting_so
# make alist of predictors and format them
colnames(moneyball_imp)
[1] "index" "target_wins" "batting_h" "batting_2b" "batting_3b"
[6] "batting_hr" "batting_bb" "batting_so" "baserun_sb" "baserun_cs"
[11] "pitching_h" "pitching_hr" "pitching_bb" "pitching_so" "fielding_e"
[16] "fielding_dp" "batting_hbp" "batting_hbp_bi" "batting_1B" "free_bases_num"
[21] "total_bases" "total_bases_allowed" "HR_over_OP" "walks_over_OP" "SO_over_OP"
pred_list <-
"index + target_wins + batting_h + batting_2b + batting_3b + batting_hr +
batting_bb + batting_so + baserun_sb + baserun_cs + pitching_h + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + batting_hbp + batting_hbp_bi +
batting_1B + free_bases_num + total_bases + total_bases_allowed + HR_over_OP + walks_over_OP + SO_over_OP"
#keep the new variables in a vector for texting later, in cae they don't prove to be of any value.
new_var <- c("batting_1B","free_bases_num","total_bases","total_bases_allowed","HR_over_OP","walks_over_OP","SO_over_OP")
Now that we have imputed and created new variables, let’s look at the correlation matrix to understand the correlation between the variables and the traget_wins
moneyball_imp <- subset(moneyball_imp, select = -c(batting_hbp))
cor(moneyball_imp)
index target_wins batting_h batting_2b batting_3b batting_hr batting_bb batting_so
index 1.000000000 -0.021056435 -0.017920241 0.011183013 -0.005814683 0.051481047 -0.02656724 0.08519647
target_wins -0.021056435 1.000000000 0.388767521 0.289103645 0.142608411 0.176153200 0.23255986 -0.03784054
batting_h -0.017920241 0.388767521 1.000000000 0.562849678 0.427696575 -0.006544685 -0.07246401 -0.42669216
batting_2b 0.011183013 0.289103645 0.562849678 1.000000000 -0.107305824 0.435397293 0.25572610 0.18629939
batting_3b -0.005814683 0.142608411 0.427696575 -0.107305824 1.000000000 -0.635566946 -0.28723584 -0.67142084
batting_hr 0.051481047 0.176153200 -0.006544685 0.435397293 -0.635566946 1.000000000 0.51373481 0.72695383
batting_bb -0.026567236 0.232559864 -0.072464013 0.255726103 -0.287235841 0.513734810 1.00000000 0.38595534
batting_so 0.085196474 -0.037840540 -0.426692156 0.186299393 -0.671420839 0.726953830 0.38595534 1.00000000
baserun_sb 0.031624516 0.104456513 0.133389886 -0.200831439 0.531847293 -0.500564937 -0.33797164 -0.30217480
baserun_cs -0.021998669 0.033689086 0.071021479 -0.301507441 0.616615872 -0.629918304 -0.34803753 -0.42579881
pitching_h 0.017103148 -0.109937054 0.302693709 0.023692188 0.194879411 -0.250145481 -0.44977762 -0.36289699
pitching_hr 0.050985897 0.189013735 0.072853119 0.454550818 -0.567836679 0.969371396 0.45955207 0.66970706
pitching_bb -0.015287513 0.124174536 0.094193027 0.178054204 -0.002224148 0.136927564 0.48936126 0.05309569
pitching_so 0.056360521 -0.074521077 -0.236077028 0.077734017 -0.263986375 0.194797938 -0.01148177 0.42152734
fielding_e -0.009233126 -0.176484759 0.264902478 -0.235150986 0.509778447 -0.587339098 -0.65597081 -0.58183930
fielding_dp 0.010177677 -0.065095861 0.053058417 0.301414025 -0.409214014 0.478501657 0.33853564 0.25922140
batting_hbp_bi 0.047332196 0.002610647 0.019594018 0.361922796 -0.265544426 0.392199209 0.10305838 0.39651793
batting_1B -0.047074417 0.217430135 0.827584756 0.087009889 0.600399234 -0.497294855 -0.35312165 -0.74207113
free_bases_num -0.019063695 0.228098279 -0.068377971 0.297591911 -0.316009005 0.553966941 0.99101046 0.42997516
total_bases 0.025173504 0.481052452 0.641416724 0.704060978 0.038577619 0.593742440 0.54428557 0.20865100
total_bases_allowed 0.023268954 -0.059959123 0.314205398 0.119290484 0.092039617 -0.062551344 -0.30004852 -0.23084278
HR_over_OP -0.000553440 -0.060991072 -0.322055891 -0.099453882 -0.243354524 0.074559388 0.19441460 0.19623665
walks_over_OP -0.004745951 0.052184113 -0.162824365 0.011599182 -0.231156161 0.266798215 0.27356493 0.25533657
SO_over_OP 0.019397099 -0.063151948 -0.046239500 -0.007736561 0.045697731 -0.149785463 -0.20615010 -0.03681741
baserun_sb baserun_cs pitching_h pitching_hr pitching_bb pitching_so fielding_e fielding_dp
index 0.03162452 -0.02199867 0.01710315 0.05098590 -0.015287513 0.056360521 -0.009233126 0.010177677
target_wins 0.10445651 0.03368909 -0.10993705 0.18901373 0.124174536 -0.074521077 -0.176484759 -0.065095861
batting_h 0.13338989 0.07102148 0.30269371 0.07285312 0.094193027 -0.236077028 0.264902478 0.053058417
batting_2b -0.20083144 -0.30150744 0.02369219 0.45455082 0.178054204 0.077734017 -0.235150986 0.301414025
batting_3b 0.53184729 0.61661587 0.19487941 -0.56783668 -0.002224148 -0.263986375 0.509778447 -0.409214014
batting_hr -0.50056494 -0.62991830 -0.25014548 0.96937140 0.136927564 0.194797938 -0.587339098 0.478501657
batting_bb -0.33797164 -0.34803753 -0.44977762 0.45955207 0.489361263 -0.011481766 -0.655970815 0.338535639
batting_so -0.30217480 -0.42579881 -0.36289699 0.66970706 0.053095691 0.421527336 -0.581839303 0.259221402
baserun_sb 1.00000000 0.81901539 0.17588136 -0.44732762 0.031892115 0.055307907 0.598724673 -0.602198238
baserun_cs 0.81901539 1.00000000 0.13387505 -0.59171178 -0.017686782 -0.021450144 0.553690108 -0.612892723
pitching_h 0.17588136 0.13387505 1.00000000 -0.14161276 0.320676162 0.268789756 0.667759010 0.039399815
pitching_hr -0.44732762 -0.59171178 -0.14161276 1.00000000 0.221937505 0.215006676 -0.493144466 0.467400014
pitching_bb 0.03189212 -0.01768678 0.32067616 0.22193750 1.000000000 0.488322635 -0.022837561 0.207786439
pitching_so 0.05530791 -0.02145014 0.26878976 0.21500668 0.488322635 1.000000000 -0.027229749 0.110776318
fielding_e 0.59872467 0.55369011 0.66775901 -0.49314447 -0.022837561 -0.027229749 1.000000000 -0.411305133
fielding_dp -0.60219824 -0.61289272 0.03939981 0.46740001 0.207786439 0.110776318 -0.411305133 1.000000000
batting_hbp_bi -0.13506950 -0.21317271 -0.06445004 0.35794984 -0.016906833 0.134963064 -0.185315470 0.104550628
batting_1B 0.34233016 0.35130817 0.40612014 -0.41549520 -0.022820326 -0.327258839 0.547816415 -0.185951668
free_bases_num -0.34815545 -0.36821249 -0.44800796 0.49652206 0.476195183 0.006845544 -0.665319984 0.344178637
total_bases 0.02325019 -0.14947902 -0.09016127 0.62224360 0.354242024 -0.010818075 -0.236349233 0.217814783
total_bases_allowed 0.09785104 0.02757418 0.97499650 0.05669475 0.459579945 0.350267877 0.557252830 0.139943309
HR_over_OP -0.19123177 -0.12375950 -0.42822141 -0.17264012 -0.351988418 -0.091755821 -0.353210656 0.021245563
walks_over_OP -0.31004785 -0.26355189 -0.71949139 0.12897043 -0.704942270 -0.547928892 -0.508313405 0.046155397
SO_over_OP 0.21244403 0.18983408 0.47814645 -0.09881476 0.511518231 0.890681337 0.261695116 -0.007882676
batting_hbp_bi batting_1B free_bases_num total_bases total_bases_allowed HR_over_OP walks_over_OP
index 0.047332196 -0.04707442 -0.019063695 0.02517350 0.023268954 -0.00055344 -0.004745951
target_wins 0.002610647 0.21743014 0.228098279 0.48105245 -0.059959123 -0.06099107 0.052184113
batting_h 0.019594018 0.82758476 -0.068377971 0.64141672 0.314205398 -0.32205589 -0.162824365
batting_2b 0.361922796 0.08700989 0.297591911 0.70406098 0.119290484 -0.09945388 0.011599182
batting_3b -0.265544426 0.60039923 -0.316009005 0.03857762 0.092039617 -0.24335452 -0.231156161
batting_hr 0.392199209 -0.49729485 0.553966941 0.59374244 -0.062551344 0.07455939 0.266798215
batting_bb 0.103058382 -0.35312165 0.991010459 0.54428557 -0.300048525 0.19441460 0.273564933
batting_so 0.396517931 -0.74207113 0.429975162 0.20865100 -0.230842784 0.19623665 0.255336573
baserun_sb -0.135069502 0.34233016 -0.348155451 0.02325019 0.097851040 -0.19123177 -0.310047852
baserun_cs -0.213172712 0.35130817 -0.368212490 -0.14947902 0.027574182 -0.12375950 -0.263551885
pitching_h -0.064450039 0.40612014 -0.448007961 -0.09016127 0.974996503 -0.42822141 -0.719491389
pitching_hr 0.357949841 -0.41549520 0.496522065 0.62224360 0.056694753 -0.17264012 0.128970430
pitching_bb -0.016906833 -0.02282033 0.476195183 0.35424202 0.459579945 -0.35198842 -0.704942270
pitching_so 0.134963064 -0.32725884 0.006845544 -0.01081807 0.350267877 -0.09175582 -0.547928892
fielding_e -0.185315470 0.54781641 -0.665319984 -0.23634923 0.557252830 -0.35321066 -0.508313405
fielding_dp 0.104550628 -0.18595167 0.344178637 0.21781478 0.139943309 0.02124556 0.046155397
batting_hbp_bi 1.000000000 -0.23605172 0.231848863 0.29527604 -0.003909755 0.11953125 0.102464739
batting_1B -0.236051718 1.00000000 -0.376395883 0.17657688 0.318513233 -0.30736736 -0.262024813
free_bases_num 0.231848863 -0.37639588 1.000000000 0.57120065 -0.293643548 0.20565630 0.280775130
total_bases 0.295276040 0.17657688 0.571200648 1.00000000 0.057905416 -0.14529452 0.051960291
total_bases_allowed -0.003909755 0.31851323 -0.293643548 0.05790542 1.000000000 -0.48106409 -0.750919119
HR_over_OP 0.119531251 -0.30736736 0.205656303 -0.14529452 -0.481064087 1.00000000 0.546339879
walks_over_OP 0.102464739 -0.26202481 0.280775130 0.05196029 -0.750919119 0.54633988 1.000000000
SO_over_OP -0.050061610 0.01139092 -0.208022326 -0.11652793 0.501731531 -0.19949844 -0.731836247
SO_over_OP
index 0.019397099
target_wins -0.063151948
batting_h -0.046239500
batting_2b -0.007736561
batting_3b 0.045697731
batting_hr -0.149785463
batting_bb -0.206150098
batting_so -0.036817410
baserun_sb 0.212444030
baserun_cs 0.189834080
pitching_h 0.478146452
pitching_hr -0.098814763
pitching_bb 0.511518231
pitching_so 0.890681337
fielding_e 0.261695116
fielding_dp -0.007882676
batting_hbp_bi -0.050061610
batting_1B 0.011390918
free_bases_num -0.208022326
total_bases -0.116527933
total_bases_allowed 0.501731531
HR_over_OP -0.199498439
walks_over_OP -0.731836247
SO_over_OP 1.000000000
Let’s test a model to establish a baseline
str(moneyball_imp)
'data.frame': 2276 obs. of 24 variables:
$ index : int 1 2 3 4 5 6 7 8 11 12 ...
$ target_wins : int 39 70 86 70 82 75 80 85 86 76 ...
$ batting_h : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
$ batting_2b : int 194 219 232 209 186 200 179 171 197 213 ...
$ batting_3b : int 39 22 35 38 27 36 54 37 40 18 ...
$ batting_hr : int 13 190 137 96 102 92 122 115 114 96 ...
$ batting_bb : int 143 685 602 451 472 443 525 456 447 441 ...
$ batting_so : int 842 1075 917 922 920 973 1062 1027 922 827 ...
$ baserun_sb : int 341 37 46 43 49 107 80 40 69 72 ...
$ baserun_cs : int 193 28 27 30 39 59 54 36 27 34 ...
$ pitching_h : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
$ pitching_hr : int 84 191 137 97 102 92 122 116 114 96 ...
$ pitching_bb : int 927 689 602 454 472 443 525 459 447 441 ...
$ pitching_so : int 5456 1082 917 928 920 973 1062 1033 922 827 ...
$ fielding_e : int 1011 193 175 164 138 123 136 112 127 131 ...
$ fielding_dp : int 178 155 153 156 168 149 186 136 169 159 ...
$ batting_hbp_bi : num 0 0 0 0 0 0 0 0 0 0 ...
$ batting_1B : int 1199 908 973 1044 982 951 889 950 1040 944 ...
$ free_bases_num : num 143 685 602 451 472 443 525 456 447 441 ...
$ total_bases : num 2240 2894 2738 2454 2364 ...
$ total_bases_allowed: num 10627 2800 2527 2238 2177 ...
$ HR_over_OP : int -71 -1 0 -1 0 0 0 -1 0 0 ...
$ walks_over_OP : int -784 -4 0 -3 0 0 0 -3 0 0 ...
$ SO_over_OP : int 4614 7 0 6 0 0 0 6 0 0 ...
base_model_all <- lm(target_wins ~ batting_h + batting_2b + batting_3b + batting_hr + batting_bb + batting_so + baserun_sb + baserun_cs + pitching_h + pitching_hr + pitching_bb + pitching_so + fielding_e + fielding_dp + batting_hbp + batting_hbp_bi + batting_1B + free_bases_num + total_bases + total_bases_allowed + HR_over_OP + walks_over_OP + SO_over_OP, data = moneyball_imp)
par(mfrow=c(2,2))
plot(base_model_all)
summary(base_model_all)
Call:
lm(formula = target_wins ~ batting_h + batting_2b + batting_3b +
batting_hr + batting_bb + batting_so + baserun_sb + baserun_cs +
pitching_h + pitching_hr + pitching_bb + pitching_so + fielding_e +
fielding_dp + batting_hbp + batting_hbp_bi + batting_1B +
free_bases_num + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP, data = moneyball_imp)
Residuals:
Min 1Q Median 3Q Max
-19.8708 -5.6564 -0.0599 5.2545 22.9274
Coefficients: (8 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.28826 19.67842 3.064 0.00253 **
batting_h 1.91348 2.76139 0.693 0.48927
batting_2b 0.02639 0.03029 0.871 0.38484
batting_3b -0.10118 0.07751 -1.305 0.19348
batting_hr -4.84371 10.50851 -0.461 0.64542
batting_bb -4.45969 3.63624 -1.226 0.22167
batting_so 0.34196 2.59876 0.132 0.89546
baserun_sb 0.03304 0.02867 1.152 0.25071
baserun_cs -0.01104 0.07143 -0.155 0.87730
pitching_h -1.89096 2.76095 -0.685 0.49432
pitching_hr 4.93043 10.50664 0.469 0.63946
pitching_bb 4.51089 3.63372 1.241 0.21612
pitching_so -0.37364 2.59705 -0.144 0.88577
fielding_e -0.17204 0.04140 -4.155 5.08e-05 ***
fielding_dp -0.10819 0.03654 -2.961 0.00349 **
batting_hbp 0.08247 0.04960 1.663 0.09815 .
batting_hbp_bi NA NA NA NA
batting_1B NA NA NA NA
free_bases_num NA NA NA NA
total_bases NA NA NA NA
total_bases_allowed NA NA NA NA
HR_over_OP NA NA NA NA
walks_over_OP NA NA NA NA
SO_over_OP NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.467 on 175 degrees of freedom
(2085 observations deleted due to missingness)
Multiple R-squared: 0.5501, Adjusted R-squared: 0.5116
F-statistic: 14.27 on 15 and 175 DF, p-value: < 2.2e-16
mse <- function(sm)
mean(sm$residuals^2)
paste('MSE equal ', mse(base_model_all))
[1] "MSE equal 65.6852879651226"
Though R-squared and adjusted R-square is high, we can clearly see that this model dropping observations. Let’s try to forget about the new additions, and build a model without them.
moneyball_orig <- moneyball_imp[,1:17]
base_model_orig <-
lm(target_wins ~ batting_h + batting_2b + batting_3b + batting_hr + batting_bb + batting_so + baserun_sb + baserun_cs + pitching_h + pitching_hr + pitching_bb + pitching_so + fielding_e + fielding_dp, data = moneyball_orig)
par(mfrow = c(2, 2))
plot(base_model_orig)
summary(base_model_orig)
Call:
lm(formula = target_wins ~ batting_h + batting_2b + batting_3b +
batting_hr + batting_bb + batting_so + baserun_sb + baserun_cs +
pitching_h + pitching_hr + pitching_bb + pitching_so + fielding_e +
fielding_dp, data = moneyball_orig)
Residuals:
Min 1Q Median 3Q Max
-46.207 -8.319 0.073 8.288 53.476
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.7456730 5.1101841 6.995 3.48e-12 ***
batting_h 0.0431762 0.0035717 12.088 < 2e-16 ***
batting_2b -0.0189845 0.0088468 -2.146 0.03199 *
batting_3b 0.0328935 0.0165608 1.986 0.04713 *
batting_hr 0.0696203 0.0264228 2.635 0.00847 **
batting_bb 0.0120441 0.0055827 2.157 0.03108 *
batting_so -0.0158916 0.0024545 -6.474 1.16e-10 ***
baserun_sb 0.0523104 0.0052785 9.910 < 2e-16 ***
baserun_cs -0.0092764 0.0104461 -0.888 0.37462
pitching_h 0.0014497 0.0003833 3.782 0.00016 ***
pitching_hr 0.0107640 0.0234254 0.459 0.64592
pitching_bb -0.0025039 0.0039757 -0.630 0.52889
pitching_so 0.0014824 0.0008894 1.667 0.09571 .
fielding_e -0.0410916 0.0026638 -15.426 < 2e-16 ***
fielding_dp -0.1187948 0.0125159 -9.492 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.59 on 2261 degrees of freedom
Multiple R-squared: 0.3649, Adjusted R-squared: 0.361
F-statistic: 92.8 on 14 and 2261 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(base_model_orig))
[1] "MSE equal 157.509836420181"
This model looks good, from a performance point of view(r2), but when I look at the variance of the residual I don’t feel secure.
Let’s build another model including lon those with low p-Values.
base_model_lp <-
lm(target_wins ~ batting_h + batting_2b + batting_hr + batting_bb + batting_so + baserun_sb + pitching_h + pitching_so + fielding_e + fielding_dp, data = moneyball_orig)
par(mfrow = c(2, 2))
plot(base_model_lp)
summary(base_model_lp)
Call:
lm(formula = target_wins ~ batting_h + batting_2b + batting_hr +
batting_bb + batting_so + baserun_sb + pitching_h + pitching_so +
fielding_e + fielding_dp, data = moneyball_orig)
Residuals:
Min 1Q Median 3Q Max
-46.974 -8.351 0.134 8.278 52.051
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.2536191 4.9339802 6.942 5.02e-12 ***
batting_h 0.0458731 0.0033148 13.839 < 2e-16 ***
batting_2b -0.0202481 0.0087922 -2.303 0.02137 *
batting_hr 0.0769585 0.0088515 8.694 < 2e-16 ***
batting_bb 0.0097206 0.0030308 3.207 0.00136 **
batting_so -0.0160263 0.0023549 -6.806 1.28e-11 ***
baserun_sb 0.0509706 0.0041882 12.170 < 2e-16 ***
pitching_h 0.0013127 0.0003368 3.897 0.00010 ***
pitching_so 0.0010922 0.0006662 1.640 0.10124
fielding_e -0.0409861 0.0026598 -15.410 < 2e-16 ***
fielding_dp -0.1191827 0.0123487 -9.651 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.59 on 2265 degrees of freedom
Multiple R-squared: 0.3636, Adjusted R-squared: 0.3608
F-statistic: 129.4 on 10 and 2265 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(base_model_lp))
[1] "MSE equal 157.839856479662"
Lets remove variables causing multicollinearity using findCorrelation().
to_rm <- colnames(cor(moneyball_imp)[,findCorrelation(cor(moneyball_imp))])
to_rm
[1] "batting_hr" "free_bases_num" "pitching_h"
base_model_noCol <-
lm(target_wins ~ batting_h + batting_2b + batting_bb + batting_so + baserun_sb + pitching_so + fielding_e + fielding_dp, data = moneyball_orig)
par(mfrow = c(2, 2))
plot(base_model_noCol)
summary(base_model_noCol)
Call:
lm(formula = target_wins ~ batting_h + batting_2b + batting_bb +
batting_so + baserun_sb + pitching_so + fielding_e + fielding_dp,
data = moneyball_orig)
Residuals:
Min 1Q Median 3Q Max
-47.479 -8.460 0.291 8.567 46.090
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.9921051 4.2605180 2.111 0.034919 *
batting_h 0.0579261 0.0031329 18.489 < 2e-16 ***
batting_2b -0.0179472 0.0089646 -2.002 0.045403 *
batting_bb 0.0157299 0.0029952 5.252 1.65e-07 ***
batting_so -0.0027274 0.0017083 -1.597 0.110494
baserun_sb 0.0357246 0.0039159 9.123 < 2e-16 ***
pitching_so 0.0019710 0.0005953 3.311 0.000945 ***
fielding_e -0.0334027 0.0021195 -15.759 < 2e-16 ***
fielding_dp -0.0923827 0.0120924 -7.640 3.19e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.85 on 2267 degrees of freedom
Multiple R-squared: 0.3371, Adjusted R-squared: 0.3348
F-statistic: 144.1 on 8 and 2267 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(base_model_noCol))
[1] "MSE equal 164.403742392932"
Though the rsquared value went down, there are some improvements on the Cook’s distance chart. Now let’s try to use use the caret package to apply the transformations we discussed earlier in our exploration phase.
trans <- preProcess(moneyball_imp, method = c("BoxCox"))
transformed <- predict(trans, moneyball_imp)
head(transformed)
index target_wins batting_h batting_2b batting_3b batting_hr batting_bb
1 0.0000000 39 0.7691708 37.64575 39 13 143
2 0.8921497 70 0.7691645 40.61141 22 190 685
3 1.6538133 86 0.7691669 42.09981 35 137 602
4 2.3414512 70 0.7691675 39.44230 38 96 451
5 2.9788133 82 0.7691617 36.66490 27 102 472
6 3.5787773 75 0.7691604 38.37081 36 92 443
batting_so baserun_sb baserun_cs pitching_h pitching_hr pitching_bb pitching_so
1 842 341 193 0.5000000 84 927 5456
2 1075 37 28 0.4999997 191 689 1082
3 917 46 27 0.4999997 137 602 917
4 922 43 30 0.4999997 97 454 928
5 920 49 39 0.4999997 102 472 920
6 973 107 59 0.4999997 92 443 973
fielding_e fielding_dp batting_hbp_bi batting_1B free_bases_num total_bases
1 1.108916 2491.398 0 0.4999997 143 8731.220
2 1.101367 1996.525 0 0.4999994 685 11873.712
3 1.100469 1955.454 0 0.4999995 602 11109.802
4 1.099829 2017.181 0 0.4999995 451 9741.618
5 1.097933 2271.200 0 0.4999995 472 9314.443
6 1.096495 1874.275 0 0.4999994 443 9375.948
total_bases_allowed HR_over_OP walks_over_OP SO_over_OP
1 0.5263158 -71 -784 4614
2 0.5263156 -1 -4 7
3 0.5263156 0 0 0
4 0.5263156 -1 -3 6
5 0.5263155 0 0 0
6 0.5263155 0 0 0
trans_model_all <-
lm(target_wins ~ batting_h + batting_2b + batting_3b + batting_bb + batting_so + baserun_sb + baserun_cs + pitching_h + pitching_hr + pitching_bb + pitching_so + fielding_e + fielding_dp + batting_1B + free_bases_num + total_bases + total_bases_allowed + HR_over_OP + walks_over_OP + SO_over_OP, data = transformed)
par(mfrow = c(2, 2))
plot(trans_model_all)
summary(trans_model_all)
Call:
lm(formula = target_wins ~ batting_h + batting_2b + batting_3b +
batting_bb + batting_so + baserun_sb + baserun_cs + pitching_h +
pitching_hr + pitching_bb + pitching_so + fielding_e + fielding_dp +
batting_1B + free_bases_num + total_bases + total_bases_allowed +
HR_over_OP + walks_over_OP + SO_over_OP, data = transformed)
Residuals:
Min 1Q Median 3Q Max
-53.068 -7.876 -0.019 8.154 54.708
Coefficients: (5 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.094e+05 1.205e+05 -0.908 0.36383
batting_h 1.452e+05 1.566e+05 0.927 0.35409
batting_2b -3.757e-01 1.154e-01 -3.256 0.00114 **
batting_3b 3.242e-02 2.369e-02 1.368 0.17129
batting_bb 1.433e-01 1.837e-02 7.802 9.27e-15 ***
batting_so -1.766e-02 2.567e-03 -6.881 7.68e-12 ***
baserun_sb 1.227e-03 9.393e-03 0.131 0.89610
baserun_cs 9.062e-03 1.056e-02 0.858 0.39074
pitching_h NA NA NA NA
pitching_hr -4.568e-02 2.612e-02 -1.749 0.08045 .
pitching_bb -7.883e-03 3.459e-03 -2.279 0.02277 *
pitching_so 3.592e-03 8.920e-04 4.027 5.85e-05 ***
fielding_e -1.985e+03 1.311e+02 -15.137 < 2e-16 ***
fielding_dp -6.610e-03 6.394e-04 -10.337 < 2e-16 ***
batting_1B NA NA NA NA
free_bases_num -1.455e-01 2.066e-02 -7.043 2.49e-12 ***
total_bases 6.612e-03 1.590e-03 4.158 3.33e-05 ***
total_bases_allowed NA NA NA NA
HR_over_OP -4.014e-02 3.440e-02 -1.167 0.24340
walks_over_OP NA NA NA NA
SO_over_OP NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.67 on 2260 degrees of freedom
Multiple R-squared: 0.3571, Adjusted R-squared: 0.3529
F-statistic: 83.7 on 15 and 2260 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(trans_model_all))
[1] "MSE equal 159.445223911529"
par(mfrow = c(1, 3))
i = 2
while (i %in% c(2:17)) {
plot(transformed[,i], transformed$index, xlab = colnames(transformed)[i] , ylab = "Index", main = paste("cleveland dotplot of ",colnames(transformed)[i]))
boxplot(transformed[,i], col = "#A71930", main = paste("Boxplot of ",colnames(transformed)[i]))
hist(
transformed[,i],
col = "#A71930",
xlab = colnames(transformed)[i],
main = paste("Histogram of ",colnames(transformed)[i])
)
i = i + 1
}
Looking at Cook’s Distance, it’s clear that we have influential data, but the other charts look right where they should be.
Let’s try, stepwise approach. 1. Both direction
stepwise_base_model_bd <- stepAIC(trans_model_all, direction = "both")
Start: AIC=-991.49
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_h + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + batting_1B +
free_bases_num + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP
Step: AIC=-991.49
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_h + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + batting_1B +
free_bases_num + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP
Step: AIC=-991.49
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_h + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + batting_1B +
free_bases_num + total_bases + total_bases_allowed + HR_over_OP
Df Sum of Sq RSS AIC
- batting_1B 1 0.017 1447.9 -993.47
- baserun_cs 1 0.768 1448.6 -992.29
- pitching_hr 1 0.786 1448.7 -992.26
- baserun_sb 1 0.990 1448.9 -991.94
- batting_3b 1 1.038 1448.9 -991.86
- batting_2b 1 1.228 1449.1 -991.56
<none> 1447.9 -991.49
- total_bases_allowed 1 1.301 1449.2 -991.45
- batting_h 1 1.672 1449.5 -990.87
- pitching_bb 1 2.362 1450.2 -989.78
- HR_over_OP 1 2.569 1450.4 -989.46
- total_bases 1 6.092 1454.0 -983.94
- pitching_h 1 7.152 1455.0 -982.28
- pitching_so 1 14.666 1462.5 -970.56
- free_bases_num 1 22.046 1469.9 -959.10
- batting_bb 1 26.489 1474.4 -952.23
- batting_so 1 42.066 1489.9 -928.31
- fielding_dp 1 68.348 1516.2 -888.51
- fielding_e 1 120.742 1568.6 -811.19
Step: AIC=-993.47
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_h + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + free_bases_num +
total_bases + total_bases_allowed + HR_over_OP
Df Sum of Sq RSS AIC
- baserun_cs 1 0.780 1448.7 -994.24
- pitching_hr 1 0.973 1448.9 -993.94
- baserun_sb 1 1.061 1449.0 -993.80
<none> 1447.9 -993.47
- batting_3b 1 1.979 1449.9 -992.36
- total_bases_allowed 1 2.038 1449.9 -992.27
- pitching_bb 1 2.378 1450.3 -991.73
+ batting_1B 1 0.017 1447.9 -991.49
- HR_over_OP 1 2.872 1450.8 -990.96
- batting_2b 1 2.911 1450.8 -990.90
- batting_h 1 4.817 1452.7 -987.91
- total_bases 1 6.238 1454.1 -985.68
- pitching_h 1 10.187 1458.1 -979.51
- pitching_so 1 15.166 1463.0 -971.75
- free_bases_num 1 24.275 1472.2 -957.62
- batting_bb 1 29.578 1477.5 -949.44
- batting_so 1 42.122 1490.0 -930.20
- fielding_dp 1 69.730 1517.6 -888.41
- fielding_e 1 121.012 1568.9 -812.78
Step: AIC=-994.24
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + pitching_h + pitching_hr + pitching_bb +
pitching_so + fielding_e + fielding_dp + free_bases_num +
total_bases + total_bases_allowed + HR_over_OP
Df Sum of Sq RSS AIC
- pitching_hr 1 0.903 1449.6 -994.82
<none> 1448.7 -994.24
- total_bases_allowed 1 1.657 1450.3 -993.64
+ baserun_cs 1 0.780 1447.9 -993.47
- baserun_sb 1 2.039 1450.7 -993.04
- batting_3b 1 2.356 1451.0 -992.54
- pitching_bb 1 2.363 1451.0 -992.53
+ batting_1B 1 0.028 1448.6 -992.29
- HR_over_OP 1 2.705 1451.4 -992.00
- batting_2b 1 2.983 1451.7 -991.56
- batting_h 1 4.817 1453.5 -988.69
- total_bases 1 6.114 1454.8 -986.66
- pitching_h 1 9.596 1458.3 -981.22
- pitching_so 1 15.573 1464.2 -971.91
- free_bases_num 1 24.033 1472.7 -958.79
- batting_bb 1 29.558 1478.2 -950.27
- batting_so 1 42.228 1490.9 -930.85
- fielding_dp 1 71.951 1520.6 -885.92
- fielding_e 1 120.362 1569.0 -814.59
Step: AIC=-994.82
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + pitching_h + pitching_bb + pitching_so +
fielding_e + fielding_dp + free_bases_num + total_bases +
total_bases_allowed + HR_over_OP
Df Sum of Sq RSS AIC
<none> 1449.6 -994.82
- total_bases_allowed 1 1.421 1451.0 -994.60
+ pitching_hr 1 0.903 1448.7 -994.24
+ baserun_cs 1 0.709 1448.9 -993.94
- HR_over_OP 1 1.859 1451.4 -993.91
- pitching_bb 1 2.025 1451.6 -993.65
- batting_2b 1 2.113 1451.7 -993.51
+ batting_1B 1 0.160 1449.4 -993.08
- batting_3b 1 7.911 1457.5 -984.44
- pitching_h 1 9.605 1459.2 -981.79
- total_bases 1 11.184 1460.8 -979.33
- baserun_sb 1 14.380 1464.0 -974.36
- pitching_so 1 15.504 1465.1 -972.61
- batting_h 1 15.586 1465.2 -972.48
- free_bases_num 1 23.876 1473.5 -959.64
- batting_bb 1 28.848 1478.4 -951.97
- batting_so 1 51.216 1500.8 -917.80
- fielding_dp 1 71.451 1521.0 -887.32
- fielding_e 1 120.809 1570.4 -814.63
par(mfrow = c(2, 2))
plot(stepwise_base_model_bd)
summary(stepwise_base_model_bd)
Call:
lm(formula = target_wins ~ batting_h + batting_2b + batting_3b +
batting_bb + batting_so + baserun_sb + pitching_h + pitching_bb +
pitching_so + fielding_e + fielding_dp + free_bases_num +
total_bases + total_bases_allowed + HR_over_OP, data = transformed)
Residuals:
Min 1Q Median 3Q Max
-3.7578 -0.5110 -0.0044 0.5140 3.2705
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.900e-11 1.679e-02 0.000 1.000000
batting_h 2.724e-01 5.525e-02 4.929 8.85e-07 ***
batting_2b -5.554e-02 3.060e-02 -1.815 0.069640 .
batting_3b 1.149e-01 3.272e-02 3.512 0.000453 ***
batting_bb 9.685e-01 1.444e-01 6.706 2.51e-11 ***
batting_so -3.603e-01 4.032e-02 -8.936 < 2e-16 ***
baserun_sb 1.612e-01 3.404e-02 4.735 2.33e-06 ***
pitching_h -2.747e-01 7.099e-02 -3.870 0.000112 ***
pitching_bb -6.647e-02 3.741e-02 -1.777 0.075761 .
pitching_so 1.534e-01 3.120e-02 4.917 9.44e-07 ***
fielding_e -5.182e-01 3.776e-02 -13.724 < 2e-16 ***
fielding_dp -2.421e-01 2.294e-02 -10.554 < 2e-16 ***
free_bases_num -9.498e-01 1.557e-01 -6.101 1.23e-09 ***
total_bases 3.222e-01 7.715e-02 4.176 3.08e-05 ***
total_bases_allowed 8.224e-02 5.526e-02 1.488 0.136825
HR_over_OP -4.188e-02 2.460e-02 -1.702 0.088816 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8009 on 2260 degrees of freedom
Multiple R-squared: 0.3628, Adjusted R-squared: 0.3586
F-statistic: 85.79 on 15 and 2260 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(stepwise_base_model_bd))
[1] "MSE equal 0.636893215447998"
stepwise_base_model_fw <- stepAIC(trans_model_all, direction = "forward")
Start: AIC=-991.49
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_h + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + batting_1B +
free_bases_num + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP
par(mfrow = c(2, 2))
plot(stepwise_base_model_fw)
summary(stepwise_base_model_fw)
Call:
lm(formula = target_wins ~ batting_h + batting_2b + batting_3b +
batting_bb + batting_so + baserun_sb + baserun_cs + pitching_h +
pitching_hr + pitching_bb + pitching_so + fielding_e + fielding_dp +
batting_1B + free_bases_num + total_bases + total_bases_allowed +
HR_over_OP + walks_over_OP + SO_over_OP, data = transformed)
Residuals:
Min 1Q Median 3Q Max
-3.7240 -0.5110 -0.0064 0.5072 3.2284
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.167e-11 1.679e-02 0.000 1.000000
batting_h 2.289e-01 1.418e-01 1.614 0.106559
batting_2b -9.290e-02 6.714e-02 -1.384 0.166575
batting_3b 7.033e-02 5.528e-02 1.272 0.203418
batting_bb 9.937e-01 1.546e-01 6.426 1.59e-10 ***
batting_so -3.439e-01 4.247e-02 -8.098 9.05e-16 ***
baserun_sb 7.878e-02 6.342e-02 1.242 0.214247
baserun_cs 3.853e-02 3.521e-02 1.094 0.273929
pitching_h -2.789e-01 8.353e-02 -3.339 0.000854 ***
pitching_hr -1.385e-01 1.251e-01 -1.107 0.268323
pitching_bb -7.248e-02 3.778e-02 -1.919 0.055141 .
pitching_so 1.511e-01 3.161e-02 4.781 1.85e-06 ***
fielding_e -5.321e-01 3.878e-02 -13.719 < 2e-16 ***
fielding_dp -2.401e-01 2.326e-02 -10.322 < 2e-16 ***
batting_1B -2.097e-02 1.296e-01 -0.162 0.871526
free_bases_num -1.036e+00 1.767e-01 -5.862 5.24e-09 ***
total_bases 4.946e-01 1.605e-01 3.082 0.002083 **
total_bases_allowed 9.507e-02 6.676e-02 1.424 0.154599
HR_over_OP -7.378e-02 3.687e-02 -2.001 0.045487 *
walks_over_OP NA NA NA NA
SO_over_OP NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8009 on 2257 degrees of freedom
Multiple R-squared: 0.3636, Adjusted R-squared: 0.3585
F-statistic: 71.63 on 18 and 2257 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(stepwise_base_model_fw))
[1] "MSE equal 0.636146688776401"
stepwise_base_model_bw <- stepAIC(trans_model_all, direction = "backward")
Start: AIC=-991.49
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_h + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + batting_1B +
free_bases_num + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP
Step: AIC=-991.49
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_h + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + batting_1B +
free_bases_num + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP
Step: AIC=-991.49
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_h + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + batting_1B +
free_bases_num + total_bases + total_bases_allowed + HR_over_OP
Df Sum of Sq RSS AIC
- batting_1B 1 0.017 1447.9 -993.47
- baserun_cs 1 0.768 1448.6 -992.29
- pitching_hr 1 0.786 1448.7 -992.26
- baserun_sb 1 0.990 1448.9 -991.94
- batting_3b 1 1.038 1448.9 -991.86
- batting_2b 1 1.228 1449.1 -991.56
<none> 1447.9 -991.49
- total_bases_allowed 1 1.301 1449.2 -991.45
- batting_h 1 1.672 1449.5 -990.87
- pitching_bb 1 2.362 1450.2 -989.78
- HR_over_OP 1 2.569 1450.4 -989.46
- total_bases 1 6.092 1454.0 -983.94
- pitching_h 1 7.152 1455.0 -982.28
- pitching_so 1 14.666 1462.5 -970.56
- free_bases_num 1 22.046 1469.9 -959.10
- batting_bb 1 26.489 1474.4 -952.23
- batting_so 1 42.066 1489.9 -928.31
- fielding_dp 1 68.348 1516.2 -888.51
- fielding_e 1 120.742 1568.6 -811.19
Step: AIC=-993.47
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_h + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + free_bases_num +
total_bases + total_bases_allowed + HR_over_OP
Df Sum of Sq RSS AIC
- baserun_cs 1 0.780 1448.7 -994.24
- pitching_hr 1 0.973 1448.9 -993.94
- baserun_sb 1 1.061 1449.0 -993.80
<none> 1447.9 -993.47
- batting_3b 1 1.979 1449.9 -992.36
- total_bases_allowed 1 2.038 1449.9 -992.27
- pitching_bb 1 2.378 1450.3 -991.73
- HR_over_OP 1 2.872 1450.8 -990.96
- batting_2b 1 2.911 1450.8 -990.90
- batting_h 1 4.817 1452.7 -987.91
- total_bases 1 6.238 1454.1 -985.68
- pitching_h 1 10.187 1458.1 -979.51
- pitching_so 1 15.166 1463.0 -971.75
- free_bases_num 1 24.275 1472.2 -957.62
- batting_bb 1 29.578 1477.5 -949.44
- batting_so 1 42.122 1490.0 -930.20
- fielding_dp 1 69.730 1517.6 -888.41
- fielding_e 1 121.012 1568.9 -812.78
Step: AIC=-994.24
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + pitching_h + pitching_hr + pitching_bb +
pitching_so + fielding_e + fielding_dp + free_bases_num +
total_bases + total_bases_allowed + HR_over_OP
Df Sum of Sq RSS AIC
- pitching_hr 1 0.903 1449.6 -994.82
<none> 1448.7 -994.24
- total_bases_allowed 1 1.657 1450.3 -993.64
- baserun_sb 1 2.039 1450.7 -993.04
- batting_3b 1 2.356 1451.0 -992.54
- pitching_bb 1 2.363 1451.0 -992.53
- HR_over_OP 1 2.705 1451.4 -992.00
- batting_2b 1 2.983 1451.7 -991.56
- batting_h 1 4.817 1453.5 -988.69
- total_bases 1 6.114 1454.8 -986.66
- pitching_h 1 9.596 1458.3 -981.22
- pitching_so 1 15.573 1464.2 -971.91
- free_bases_num 1 24.033 1472.7 -958.79
- batting_bb 1 29.558 1478.2 -950.27
- batting_so 1 42.228 1490.9 -930.85
- fielding_dp 1 71.951 1520.6 -885.92
- fielding_e 1 120.362 1569.0 -814.59
Step: AIC=-994.82
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + pitching_h + pitching_bb + pitching_so +
fielding_e + fielding_dp + free_bases_num + total_bases +
total_bases_allowed + HR_over_OP
Df Sum of Sq RSS AIC
<none> 1449.6 -994.82
- total_bases_allowed 1 1.421 1451.0 -994.60
- HR_over_OP 1 1.859 1451.4 -993.91
- pitching_bb 1 2.025 1451.6 -993.65
- batting_2b 1 2.113 1451.7 -993.51
- batting_3b 1 7.911 1457.5 -984.44
- pitching_h 1 9.605 1459.2 -981.79
- total_bases 1 11.184 1460.8 -979.33
- baserun_sb 1 14.380 1464.0 -974.36
- pitching_so 1 15.504 1465.1 -972.61
- batting_h 1 15.586 1465.2 -972.48
- free_bases_num 1 23.876 1473.5 -959.64
- batting_bb 1 28.848 1478.4 -951.97
- batting_so 1 51.216 1500.8 -917.80
- fielding_dp 1 71.451 1521.0 -887.32
- fielding_e 1 120.809 1570.4 -814.63
par(mfrow = c(2, 2))
plot(stepwise_base_model_bw)
summary(stepwise_base_model_bw)
Call:
lm(formula = target_wins ~ batting_h + batting_2b + batting_3b +
batting_bb + batting_so + baserun_sb + pitching_h + pitching_bb +
pitching_so + fielding_e + fielding_dp + free_bases_num +
total_bases + total_bases_allowed + HR_over_OP, data = transformed)
Residuals:
Min 1Q Median 3Q Max
-3.7578 -0.5110 -0.0044 0.5140 3.2705
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.900e-11 1.679e-02 0.000 1.000000
batting_h 2.724e-01 5.525e-02 4.929 8.85e-07 ***
batting_2b -5.554e-02 3.060e-02 -1.815 0.069640 .
batting_3b 1.149e-01 3.272e-02 3.512 0.000453 ***
batting_bb 9.685e-01 1.444e-01 6.706 2.51e-11 ***
batting_so -3.603e-01 4.032e-02 -8.936 < 2e-16 ***
baserun_sb 1.612e-01 3.404e-02 4.735 2.33e-06 ***
pitching_h -2.747e-01 7.099e-02 -3.870 0.000112 ***
pitching_bb -6.647e-02 3.741e-02 -1.777 0.075761 .
pitching_so 1.534e-01 3.120e-02 4.917 9.44e-07 ***
fielding_e -5.182e-01 3.776e-02 -13.724 < 2e-16 ***
fielding_dp -2.421e-01 2.294e-02 -10.554 < 2e-16 ***
free_bases_num -9.498e-01 1.557e-01 -6.101 1.23e-09 ***
total_bases 3.222e-01 7.715e-02 4.176 3.08e-05 ***
total_bases_allowed 8.224e-02 5.526e-02 1.488 0.136825
HR_over_OP -4.188e-02 2.460e-02 -1.702 0.088816 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8009 on 2260 degrees of freedom
Multiple R-squared: 0.3628, Adjusted R-squared: 0.3586
F-statistic: 85.79 on 15 and 2260 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(stepwise_base_model_bw))
[1] "MSE equal 0.636893215447998"
It definitely made a difference when the transformation were applied. One can see the difference in the residual plots. The residual is now normal(per QQ plot), and there are no patterns when we look at he Rsiduals Vs Fitted plot. When looking at the Rsquared and Adjusted Rsquared together with the residual plots, it’s easy to conclude that the model with the stepwise approach together with the transformations is the one that leads to a better model.
Though RMSE and Rsquared from the other models seem to suggest otherwise, the stepwise model appears to be more stable. I also noticed by looking at the Cook’s Distance plot that there are influncial observations, but for some reason I could not get robust regression to work. From my understanding, robust regression would put less enphasis on those data points, leading to a more accurate model.